NOTE: After doing excel analysis I found there are columns like 'radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1' which have high range which can possibly mean mix of gaussian or outliers. Also the data has lots of missing values.
EDA: this step include desciptive analysis, multivariate analysis(pair plots and box plots), treating outliers ,treating missing values. Also I removed 'scaled_radius_of_gyration' reason stated down the line
Model buliding: In I made SVC on only scaled data.
PCA: In this step I first selected only those features in which I can apply PCA which were all columns except these scaled_radius_of_gyration.1, skewness_about, skewness_about.1, skewness_about.2,hollows_ratio. Then I found best n_componets i.e optimum principle components which explain more than 95% of variance using elbow method. Then I combined these components and those left out columns again Then made SVC using those combined dataset and evaluated the model.
LDA: In this step also I first selected only those features in which I can apply LDA which were all columns except these scaled_radius_of_gyration.1, skewness_about, skewness_about.1, skewness_about.2,hollows_ratio. Then I applied LDA to remaing columns and reduced featues to 2 columns. Then I combined these components and those left out columns again Then made SVC using those combined dataset and evaluated the model.
KPCA: In this I converted all features X to Kernel principle 'rbf' components and made SVC model and evaluated the model.
Grid search and cross validation: Here I found optimum hyper-parameter and did cross validation on it.
Conclusion: This is were i concluded about everything I did and conculed weather to use PCA or LDA or KPCA or none of them
Note:In 3 and 4 step i.e PCA and LDA, I splited data because its better and is must to apply pca and lda to only those which shows linear correlation with other columns
I have Treated NaN values with median of that column and for outliers, I have replaced the outliers with median value as number of outliers are less so converting them to median wont change our dataset much. Also I could have removed outliers which was giving me higher accuracy than replacing outliers but doing so the data got reduced from 849 to 801 this doesn't seems much but data is already less so further reducing data doesn't seems a good option so I replaced outliers with median of that column. Increase data gives high quality
#importing Libs
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,precision_score,recall_score,f1_score
from sklearn import metrics
from matplotlib.colors import ListedColormap
from sklearn.svm import SVC
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
sns.set_style('whitegrid')
%matplotlib inline
#importing data into df
df=pd.read_csv("vehicle-2.csv")
#checking the data dimensions
df.shape
#checking the data if its loaded properly
df.head()
df.describe().T
#After seeing 'count' we can conclude that it has missing values
#Column such as 'distance_circularity','radius_ratio','max.length_aspect_ratio','scatter_ratio','elongatedness','max.length_rectangularity','scaled_variance','scaled_variance.1','skewness_about.1' have skew in them.
col=df.iloc[:,:-1].columns
print("differencr between mean and median")
for i in col:
k= df[i].mean() - df[i].quantile(0.50)
k=int(k)
print(i , " = " , k)
print("\n\n")
print("To see right or left side skew")
for i in col:
k= df[i].quantile(0.50) - df[i].quantile(0.75)
k2= df[i].quantile(0.25) - df[i].quantile(0.50)
print(i , " = " , abs(k2) , " " , abs(k))
print("\n")
#From describe and above code shows these column have skewness:
#right side Skew : 'skewness_about.1', 'scaled_variance.1', 'scaled_variance', 'max.length_rectangularity', 'pr.axis_rectangularity', 'scatter_ratio', 'radius_ratio', 'distance_circularity'
#Left side SKew: 'hollows_ratio', 'elongatedness'
df.info() #It is evident that data has missing values
#Also every featues are int or float type
#To see datatypes of each column
df.dtypes
#To check number of missing value in each column
print("The output will be the sum of missing values columnwise\n\n")
print(df.isnull().sum())
#To see any missing values
df[df.isnull().any(axis=1)]
#As you can see it has missing values
#To calculating median of each column
df.median()
#Treating NaN values with median of that column using Imputer
imputer = Imputer(missing_values='NaN', strategy='median', axis=0)
imputer = imputer.fit(df.iloc[:,:-1])
imputed_data = imputer.transform(df.iloc[:,:-1].values)
df.iloc[:,:-1] = imputed_data
df.isnull().sum() #here it is evident that there is no missing values
#This step is double check for missing values
#This is just to show replaced median value for 4 column
print(df.iloc[[5,105,118],[1]])
print(df.iloc[[35,118,207],[2]])
print(df.iloc[[9,78,159],[3]])
print(df.iloc[[19,222],[4]])
#Note:I have checked for each column Here I am not showing as it will take space
Using imputer I replaced missing values with median of that column, Also I have check each replaced values.
# Let us check whether any of the columns has any value other than numeric i.e. data is not corrupted such as a "?" instead of
# a number.
# we use np.isreal a numpy function which checks each column for each row and returns a bool array,
# where True if input element is real.
# applymap is pandas dataframe function that applies the np.isreal function columnwise
# Following line selects those rows which have some non-numeric value in any of the columns hence the ~ symbol
df[~df.applymap(np.isreal).all(1)]
#converting class values to numerical.
df['class'] = df['class'].replace({'car': 1, 'van': 2, 'bus': 3})
#Note: This step is not necessary but its better to convert so to perform many operations.
df['class'].unique()
I have done pairplots to see mix of gaussian, outliers, relation between features with other features and with target column, Also the distribution
sns.pairplot(df, diag_kind='kde');
As it is clear from pairplots that many independent column depend on each other, Also numbe of faetues are large in number we may suffer curse of dimensionality therfore to reduce if we use reduction tech. such as PCA to create new feature independent to each other, also it reduces number of column
Also columns such as radius_ratio','pr.axis_aspect_ratio','max.length_aspect_ratio','scaled_variance','scaled_variance.1','scaled_radius_of_gyration.1','skewness_about','skewness_about.1' have outliers
Some column also have mix of gaussian such as 'circularity','scatter_ratio', 'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity', 'scaled_variance', 'scaled_variance.1' ,'skewness_about.2', 'hollows_ratio'
col=df.drop(labels='class',axis=1)
col=col.columns
print (col)
#To check outliers in numerical column
k=1
plt.figure(figsize=(40,40))
for i in col:
plt.subplot(10,2,k)
sns.boxplot(df[i],width=3)
plt.title(i)
k=k+1
plt.show()
#AS can be seen from boxplots columns such as 'radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1' have outliers
k=['radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1']
print(len(k))
#To see number of outliers in each coulmns
k=['radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1']
print("NUMBER OF OUTLIERS")
for i in k:
out1=1.5*(df[i].quantile(0.75) - df[i].quantile(0.25))
out=df[i].quantile(0.75)+out1
out2=df[i].quantile(0.25)-out1
# print(i,' = ', out1)
print(i," = ",df[df[i]>out][i].count())
#This step can also be choosen this step replace outliers with median but you have mark below cell first then run this cell
#Replacing outliers with median of that columns
df['max.length_aspect_ratio']=np.where(df['max.length_aspect_ratio']<3,df['max.length_aspect_ratio'].median(),df['max.length_aspect_ratio'])
for i in k:
out1=1.5*(df[i].quantile(0.75) - df[i].quantile(0.25))
out=df[i].quantile(0.75)+out1
df[i]=np.where(df[i]>out,df[i].median(),df[i])
df[i]=np.where(df[i]<out2,df[i].median(),df[i])
print("This step is to see if outliers have been delt with")
print("To see if data is till min or max as it will confirm outliers are removed")
print("Min")
for i in k:
print(i," = ",df[i].min())
print("\n\n")
print("Max")
for i in k:
print(i," = ",df[i].max())
#Outliers have been removed there it shows zeros
k=['radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1']
print("NUMBER OF OUTLIERS")
for i in k:
out1=1.5*(df[i].quantile(0.75) - df[i].quantile(0.25))
out=df[i].quantile(0.75)+out1
print(i," = ",df[df[i]>out][i].count())
#To check that outliers are removed or not
k=1
plt.figure(figsize=(40,40))
for i in col:
plt.subplot(10,2,k)
sns.boxplot(df[i],width=3)
plt.title(i)
k=k+1
plt.show()
#TO see each class size
print(df.groupby('class').size())
print("\nvalues in percentage\n",(df['class'].value_counts()/df['class'].count())*100)
#Here one column have high number therefore we have to upsample or downsample but in this problem it is not required as they are in ratio 1/2 which is ok
df.shape #quick check for everythig to be ok and see dataset size
#To see the corelation between two independent column and one dependent and independent column
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(),
annot=True,
linewidths=.5,
center=0,
cbar=False,
cmap="YlGnBu")
plt.show()
After seeing and performing model with various column I have decided to drop 'scaled_radius_of_gyration' as it has very less corr. Also it does not contibute much
X,y=df.drop(labels=['class','scaled_radius_of_gyration'],axis=1).values,df[['class']].values
#after seeing and performing model with various column I have decided to drop 'scaled_radius_of_gyration' as it has very less corr
df.columns
#Scaling data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X2 = sc.fit_transform(X)
#Splittind data into traing and test with 80-20 ratio
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size = 0.2, random_state = 1)
X_train.shape
#This is SVC with normal scaled data
clf = SVC(random_state=1)
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train))
y_pred=clf.predict(X_test)
confusion_matrix(y_test,y_pred)
print(clf.score(X_train,y_train))
print(clf.score(X_test,y_test))
print("accuracy : {0:.4f}".format(accuracy_score(y_test,y_pred)))
print(metrics.classification_report(y_test,y_pred))
#CROSS VALIDATION of 10 folds
res = cross_val_score(clf, X2, y, cv=10, scoring='accuracy')
print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
#split data into 2 parts X and X2. X contain those feature which shows some correletion with other features, their corr is higher than 5%.
#Where in X2, features are not correlated therefore applying PCA is not a good idea. Also PCA works with linear data
X,y=df.drop(labels=['class','scaled_radius_of_gyration','scaled_radius_of_gyration.1','skewness_about','skewness_about.1','skewness_about.2','hollows_ratio'],axis=1),df[['class']]
X2=df.drop(labels=['class','scaled_radius_of_gyration','compactness', 'circularity', 'distance_circularity', 'radius_ratio',
'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
'scaled_variance', 'scaled_variance.1'],axis=1)
X2.columns
#Scaling both X and X2
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
XScaled = sc.fit_transform(X)
type(XScaled)
sc2 = StandardScaler()
XScaled2 = sc2.fit_transform(X2)
type(XScaled2)
covMatrix = np.cov(XScaled,rowvar=False)
print(covMatrix)
#PCA to see varaince explained by all 12 coulmns
pca = PCA(n_components=None)
pca.fit(XScaled)
#printing variation explained with each column
print(pca.explained_variance_)
#printing eigon vectors acrss all 12 dimensions
print(pca.components_)
#variation explained in terms of percentage
print(pca.explained_variance_ratio_)
plt.bar(list(range(1,13)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
plt.step(list(range(1,13)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()
#Recuding 12 column to 6 as they explain more than 95% of variation
pca3 = PCA(n_components=6)
pca3.fit(XScaled)
print(pca3.components_)
print(pca3.explained_variance_ratio_)
Xpca3 = pca3.transform(XScaled)
Xpca3.shape
XScaled2.shape
sns.pairplot(pd.DataFrame(Xpca3));
#Concatenate 6 components with X2
names=['scaled_radius_of_gyration.1','skewness_about','skewness_about.1','skewness_about.2','hollows_ratio']
Xpca3=pd.DataFrame(Xpca3)
XScaled2=pd.DataFrame(XScaled2,columns=names)
XScaled2.shape
Xpca=pd.concat([Xpca3,XScaled2],axis=1)
Xpca.shape
#Splitting pca into training and test data
X_train_pca, X_test_pca, y_train, y_test = train_test_split(Xpca, y, test_size = 0.2, random_state = 1)
#SVC using pca
clf_pca = SVC(random_state=1)
clf_pca.fit(X_train_pca, y_train)
print(clf_pca.score(X_train_pca, y_train))
y_pred_pca=clf_pca.predict(X_test_pca)
confusion_matrix(y_test,y_pred_pca)
#printing scores
print(clf_pca.score(X_train_pca,y_train))
print(clf_pca.score(X_test_pca,y_test))
print("accuracy : {0:.4f}".format(accuracy_score(y_test,y_pred_pca)))
print(metrics.classification_report(y_test,y_pred_pca))
#CROSS VALIDATION of 10 folds
res = cross_val_score(clf_pca, Xpca, y, cv=10, scoring='accuracy')
print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
Note this is just a practice step as LDA is best suited for supervised learning
#split data into 2 parts X and X2. X contain those feature which shows some correletion with other features, their corr is higher than 5%.
#Where in X2, features are not correlated therefore applying LDA is not a good idea. Also LDA works with linear data similar to PCA
X,y=df.drop(labels=['class','scaled_radius_of_gyration','scaled_radius_of_gyration.1','skewness_about','skewness_about.1','skewness_about.2','hollows_ratio'],axis=1),df[['class']]
X2=df.drop(labels=['class','scaled_radius_of_gyration','compactness', 'circularity', 'distance_circularity', 'radius_ratio',
'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
'scaled_variance', 'scaled_variance.1'],axis=1)
X2.columns
#Scaling both X and X2
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
XScaled = sc.fit_transform(X)
sc2 = StandardScaler()
XScaled2 = sc2.fit_transform(X2)
#tranform to 2 components using LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_train_lda = lda.fit_transform(X, y)
#pairplot of converted features
sns.pairplot(pd.DataFrame(X_train_lda))
#Concatenate those lda components with X2
names=['scaled_radius_of_gyration.1','skewness_about','skewness_about.1','skewness_about.2','hollows_ratio']
lda=pd.DataFrame(X_train_lda)
XScaled2=pd.DataFrame(XScaled2,columns=names)
print(XScaled2.shape)
lda=pd.concat([lda,XScaled2],axis=1)
#Total shape of dataset
lda.shape
#Splitting pca into training and test data
X_train_lda, X_test_lda, y_train, y_test = train_test_split(lda, y, test_size = 0.2, random_state = 1)
#SVC using LDA data
clf_lda = SVC(random_state=1)
clf_lda.fit(X_train_lda, y_train)
print(clf_lda.score(X_train_lda, y_train))
y_pred_lda=clf_lda.predict(X_test_lda)
confusion_matrix(y_test,y_pred)
print(clf_lda.score(X_train_lda, y_train))
print(clf_lda.score(X_test_lda,y_test))
print("accuracy : {0:.4f}".format(accuracy_score(y_test,y_pred_lda)))
print(metrics.classification_report(y_test,y_pred_lda))
#CROSS VALIDATION of 10 folds
res = cross_val_score(clf_lda, lda, y, cv=10, scoring='accuracy')
print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
Note this is just a practice step as KPCA is advanced PCA and used for non linear distribution
Note : this kpca wont perform well as after seeing multivarite I could guess that data is linear and kpca works on non linear data
X,y=df.drop(labels=['class','scaled_radius_of_gyration'],axis=1).values,df[['class']].values
#Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
#scaling of data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying Kernel PCA with kernel=rbf
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = 10, kernel = 'rbf')
X_train_kpca = kpca.fit_transform(X_train)
X_test_kpca = kpca.transform(X_test)
clf_kpca = SVC(random_state=1)
clf_kpca.fit(X_train_kpca, y_train)
print(clf_kpca.score(X_train_kpca, y_train))
y_pred_kpca=clf_kpca.predict(X_test_kpca)
confusion_matrix(y_test,y_pred_kpca)
print("accuracy : {0:.4f}".format(accuracy_score(y_test,y_pred_kpca)))
print(metrics.classification_report(y_test,y_pred_kpca))
#CROSS VALIDATION of 10 folds
res = cross_val_score(clf_kpca, X_train_kpca, y_train, cv=10, scoring='accuracy')
print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
As I thought this model wont be working as good as LDA an PCA as data is linear
#To see which all parameter can we change
clf_pca.get_params()
#Here we are providing the list so that algo runs each timewith these parameter, in every iteration it will select one value and at the end it will give best hyperparameter in next cell
params = [{'C': [0.01,0.05,0.5,1], 'kernel': ['linear']},
{'C': [0.01,0.05,0.5,1], 'kernel': ['rbf']}]
grid_search_cv = GridSearchCV(clf_pca,
params,
n_jobs=-1,
verbose=1)
grid_search_cv.fit(Xpca, y)
grid_search_cv.best_estimator_
#Best c and kernel
grid_search_cv.best_params_
#SVC using pca
clf_pca = SVC(random_state=1,C=0.5,kernel='rbf')
clf_pca.fit(X_train_pca, y_train)
print(clf_pca.score(X_train_pca, y_train))
y_pred_pca=clf_pca.predict(X_test_pca)
confusion_matrix(y_test,y_pred_pca)
#printing scores
print(clf_pca.score(X_train_pca,y_train))
print(clf_pca.score(X_test_pca,y_test))
print("accuracy : {0:.4f}".format(accuracy_score(y_test,y_pred_pca)))
print(metrics.classification_report(y_test,y_pred_pca))
#CROSS VALIDATION of 10 folds
res = cross_val_score(clf_pca, Xpca, y, cv=10, scoring='accuracy')
print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
#This is not improving performance as there might be other parameters that might affect the output with these parameter
According to me when we used the original data we got accuracy of 95.88% with CV score of 95.51% with std of 1.95. When we applied PCA and reduced column to 11 we got accuracy of 95.29% with CV score of 95.75% with std of 2.63 which according to me is good as 7 columns or features are reduced also we got added benefit of speed because of reduced dataset when using PCA But in this kind of problem I would choose LDA as it is best suited for supervised technique and also it has least features which is 7 and it gave accuracy of 95.88% with CV score of 94.46% with std of 3.32, after LDA then I would choose PCA